1 Introduction

This lab offers an introduction to some of the ideas in text analysis. The code here is not necessarily indicative of best practice- but it is hopefully helpful as a springboard for basic concepts.

We will primarily be using Ken Benoit’s quanteda package for text manipulation. There are several other fantastic options including corpus, tidytext and tm. stm which we will use for topic modeling also has some pre-processing tools. quanteda has a great vignette to help you get started (here). Ken also has great exercises available from his course.

2 A Note About Citation

Most of the software packages are written by academics. Reliable and easy-to-use software is really difficult to make. If you use these packages in your published work: please cite them. In R you can even see how the author would like to be cited (and get a bibtex entry).

citation("quanteda")
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/Detroit'
## 
## Benoit K (2017). _quanteda: Quantitative Analysis of Textual
## Data_. doi: 10.5281/zenodo.1004683 (URL:
## http://doi.org/10.5281/zenodo.1004683), R package version 0.99.22,
## <URL: http://quanteda.io>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {quanteda: Quantitative Analysis of Textual Data},
##     author = {Kenneth Benoit},
##     year = {2017},
##     doi = {10.5281/zenodo.1004683},
##     url = {http://quanteda.io},
##     note = {R package version 0.99.22},
##   }

3 Alternate Document Sets

In the folder I’ve included a number of alternative datasets that you can work with. quanteda, stm and the mnir package also have a number of additional text sources. Even better- use your own!

4 Preparation

We start with data downloaded from JSTOR’s Data For Research (DFR) database. I’ve pulled the abstracts for all articles published in the American Journal of Sociology (AJS) and American Sociological Review (ASR) for the years 1975-2013. DFR limits data pulls to 1000 documents at a time, so there are five tab-separated files we’ll be working with. These files are in a .zip file at the following link

First, load some of the general packages.

library(tidyverse)
library(quanteda)

5 Pre-processing Text

For each of the JSOTR data pulls, there is a citations.tsv file that has the tab-separated citation information of each document, including the text of the article’s abstract.

Unfortunately, the format of the .tsv contains a trailing tab – so annoying with tab-delimmeted data. So we need to read in the data and the column names separately or R gets too clever and we end up with mismatched names and data. To make this easier, we provide the following function to load in the data. There is often some annoying thing you need to do to get your data loaded into R.

# load ASR data
getJournalCites <- function(file.loc) {
  #
  col.names <- unlist(read.delim(file.loc, header=F, stringsAsFactors = F, nrow=1))
  cites <- read.delim(file.loc, header=F, stringsAsFactors = F, skip=1) 
  # remove the trailing column, and any without abstracts
  cites <- select(cites, -14)
  names(cites) <- col.names
  cites <- filter(cites, nchar(abstract)>0)
  cites$year <- format(as.Date(cites$pubdate), "%Y")
  return(cites)
}

# load each in turn
ajs.cites <- getJournalCites("data/ajs_1997_2013/citations.tsv")
ajs.cites.2 <- getJournalCites("data/ajs_75_96/citations.tsv")
asr.cites <- getJournalCites("data/asr_1997_2013/citations.tsv") 
asr.cites.2 <- getJournalCites("data/asr_85_96/citations.tsv") 
asr.cites.3 <- getJournalCites("data/asr_75_84/citations.tsv") 

# package into a single data frame
cites <- rbind(ajs.cites, ajs.cites.2, asr.cites, asr.cites.2, asr.cites.3)
cites$doc_id <- paste(cites$author, cites$year) #let's make a variable to identify it

Take a look at the data frame you just produced. For small to medium sized document collections (let’s say less than a million documents), this “spreadsheet” format is often the easiest way to work with data.

The quanteda package contains some great tools for easily processing a set of documents into the document-term matrix numerical representation we’ll use. First, we construct a corpus object from the loaded data. We’ll use the “id” field for docnames instead of the “title” field because titles are not 100% unique in these data. The docvars option lets us bring in other document metadata we’d like to have available with the corpus later.

corpus <- corpus(cites, 
                 docid_field="doc_id", 
                 text_field="abstract")
summary(corpus,3)
## Corpus consisting of 3402 documents, showing 3 documents:
## 
##                                               Text Types Tokens Sentences
##  Paul Ingram, Jeffrey Robinson, Marc L. Busch 2005    97    154         5
##                               Michael Hechter 2004    73    139         6
##           Don Tomaskovic‐Devey, Sheryl Skaggs 2002   112    175         8
##              id            doi
##  10.1086/497350 10.1086/497350
##  10.1086/421357 10.1086/421357
##  10.1086/344214 10.1086/344214
##                                                                                          title
##  The Intergovernmental Network of World Trade: IGO Connectedness, Governance, and Embeddedness
##                                                                          From Class to Culture
##                    Sex Segregation, Labor Process Organization, and Gender Earnings Inequality
##                                        author
##  Paul Ingram, Jeffrey Robinson, Marc L. Busch
##                               Michael Hechter
##           Don Tomaskovic‐Devey, Sheryl Skaggs
##                   journaltitle volume issue              pubdate
##  American Journal of Sociology    111     3 2005-11-01T00:00:00Z
##  American Journal of Sociology    110     2 2004-09-01T00:00:00Z
##  American Journal of Sociology    108     1 2002-07-01T00:00:00Z
##    pagerange                       publisher type reviewed-work year
##  pp. 824-858 The University of Chicago Press  fla            NA 2005
##  pp. 400-445 The University of Chicago Press  fla            NA 2004
##  pp. 102-128 The University of Chicago Press  fla            NA 2002
## 
## Source:  /Users/brandonstewart/Dropbox/Slides/MichiganWorkshop2017/labs/* on x86_64 by brandonstewart
## Created: Thu Nov 30 21:44:27 2017
## Notes:

Next, we transform the corpus into a document feature matrix (also called a document-term matrix, document-token matrix, or document-word matrix) where each row is a document (article) and each column is a distinct term. As we’ll see below, this object can be used in many places as a document-term matrix. The dfm() has an option verbose= that defaults to \(TRUE\); turning this to \(FALSE\) will mask the following output.

mydfm <- dfm(corpus, remove=stopwords("english"), remove_punct=TRUE)

We can also remove rare terms. The advantage of this trimming is that each word is a dimension in the vector representation of each document. Thus, we remove a lot of dimensions from our data without losing very much information. Setting verbose=TRUE provides us with some useful information about the number of features that were removed.

mydfm <- dfm_trim(mydfm, min_count = 5, min_docfreq = 3, verbose = TRUE)
## Removing features occurring:
##   - fewer than 5 times: 14,248
##   - in fewer than 3 documents: 12,707
##   Total features removed: 14,429 (69.9%).

6 Quick Overviews

Quanteda makes it easy to get quick overviews of your data. The first simple summary is the most frequently occuring words. This is often a good first check that something hasn’t gone horribly wrong. For example, when first preparing this, I forgot to remove punctuation and many pieces of punctuation were the most frequent terms.

topfeatures(mydfm, 10)
##    social      data   effects    theory  analysis     model  economic 
##      2961      1686      1340      1216      1169      1069      1010 
## political     study  research 
##       990       982       976

A fun way of summarizing your text is with a word cloud package. quanteda can do this for us as well. Under the hood, quanteda is calling the wordcloud R package to do the magic.

set.seed(48109)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
                   rot.per = .25, max.words=50,
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

A tremendously useful technique is Key Words In Context (KWIC), This let’s you know that what you are guessing about how a word is used is actually right.

kwic(corpus, "nuance*", window=3)
##                                                                      
##  [Nicholas Pedriana, Robin Stryker 2004, 156] authors contribute to |
##                       [Rick Grannis 2010, 48]          . Similarly, |
##          [Nicola Beisel, Tamara Kay 2004, 67]       develops a more |
##          [Iddo Tavory, Ann Swidler 2009, 182]            for a more |
##     [John Iceland, Kyle Anne Nelson 2008, 56]         obtain a more |
##      [Marc Dixon, Andrew W. Martin 2012, 184]        suggest a more |
##                  [Timothy J. Owens 1994, 154]         are employed, |
##                                         
##  nuanced | explanations of state        
##  nuances | in how we                    
##  nuanced | approach to intersectionality
##  nuanced | analysis of culture          
##  nuanced | view of spatial              
##  nuanced | picture of union             
##  nuances | are revealed that
kwic(corpus, phrase("causal inference"), window=3)
##                                                                    
##                                         [James Mahoney 1999, 13:14]
##  [Geoffrey T. Wodtke, David J. Harding, Felix Elwert 2011, 109:110]
##                                                                     
##  single strategy of | causal inference | . In fact                  
##    novel methods of | causal inference | for time-varying treatments

7 Comparisons by Group

Often in the social sciences we care about contrasts in language between one group and another. The following code sums over the relevant grouping variable.

group_dfm <- dfm(corpus, remove=stopwords("english"), remove_punct=TRUE,
             groups="journaltitle")
group_dfm[,1:10]
## Document-feature matrix of: 2 documents, 10 features (0% sparse).
## 2 x 10 sparse Matrix of class "dfm"
##                                features
## docs                            membership certain intergovernmental
##   American Journal of Sociology         40      47                 4
##   American Sociological Review          78      59                 3
##                                features
## docs                            organizations igos world trade
##   American Journal of Sociology           201   11   102    35
##   American Sociological Review            269    5   195    35
##                                features
## docs                            organization long argued
##   American Journal of Sociology          146   39     52
##   American Sociological Review           210   65     67

By adding comparison=TRUE we can now plot a word cloud that compares the groups in our dataset.

set.seed(48109)
textplot_wordcloud(group_dfm,max.words=50,
                   colors = RColorBrewer::brewer.pal(8,"Dark2"),
                   comparison=TRUE)

8 Simple Cluster Identification: K-means

We are going to analyze this corpus using the \(k\)-means algorithm. \(k\)-means is the most popular (and arguably important) flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of documents from their cluster centers where a cluster center is defined as the mean or centroid \(\overrightarrow{\mu}\) of the documents in a cluster \(\omega: \overrightarrow{\mu}(\omega) = \frac{1}{|\omega|}\sum_{\overrightarrow{x} \in \omega} \overrightarrow{x}\) (Manning, Raghavan and Schutze, 2008, 360).

The \(k\)-means algorithm uses a random initialization. In order to get the same results each time we will set the random seed. We will need to rerun this code immediately before executing the model if we want to guarantee the same results.

set.seed(12345)                            

Obviously we can try other seeds to see if we get different results.

The kmeans() function in R takes three relevant inputs: our data matrix (input)) our k or the number of clusters we are looking for and the number of starts. The number of starts will initialize the algorithm with however many random starting points the user requests. Then it displays the results with the lowest residual sum of squares. If your computer is running slowly you can decrease the number of starts.

For the purposes of this exercise to make this run faster we are going to really aggressively trim down the size of the documents.

mydfm <- dfm_trim(mydfm, min_count = 50, min_docfreq = 25, verbose = TRUE)
## Removing features occurring:
##   - fewer than 50 times: 4,987
##   - in fewer than 25 documents: 4,367
##   Total features removed: 4,995 (80.5%).

In addition to generating clusters for 10, let’s compare that to 5 clusters. Also, let’s see if subsequent runs of kmeans produces similar clusters.

Code runtime interlude: What’s k-means doing? ?

kmeans.results.10 <- kmeans(mydfm, centers = 10, nstart = 3) 

The cluster element of these results objects has a cluster number for each of the documents in our corpus. Of course, simply looking at this list gives us very little indication of what’s going on.

head(kmeans.results.10$cluster)
##                            Paul Ingram, Jeffrey Robinson, Marc L. Busch 2005 
##                                                                            9 
##                                                         Michael Hechter 2004 
##                                                                           10 
##                                     Don Tomaskovic‐Devey, Sheryl Skaggs 2002 
##                                                                            9 
##                                    Merril Silverstein, Vern L. Bengtson 1997 
##                                                                            9 
##                                                          Paul M. Hirsch 1997 
##                                                                            9 
## Susan Clampet-Lundquist, Kathryn Edin, Jeffrey R. Kling, Greg J. Duncan 2011 
##                                                                            9

8.1 Labeling Clusters: Distinguishing Words

Cluster labels (or topic names) can be generated automatically, but you should always manually read a sample of documents as well.

The guiding principle behind automatic labelling of clusters is to find those words that best characterize the cluster. Any automatic labeling strikes a balance between finding words that are representative of the cluster but also distinguish the cluster from others. The words we identify should minimize the surprise when we read the documents after viewing the labels. The variance-weighted log-odds ratio developed by Monroe et al. in “Fighting Words” is a very useful tool for ranking the words within clusters in order to generate labels.

First, we need a function, clusterFightingWords to calculate the cluster-specific variance-weighted log-odds of each term.

# dfm -- a quanteda document-term object
# clust.vect -- a boolean vector the same length of dfm, where TRUE indicates a member of the focal cluster
# alpha.0 -- the strength of the prior, expressed as number of terms per sub-corpus 
clusterFightinWords <- function(dfm, clust.vect, alpha.0=500) {
  # we need to get the overall corpus word distribution and the cluster-specific words dists
  # y_{kw} in Monroe et al. 
  overall.terms <- colSums(dfm)
  # n and n_k in Monroe et al. 
  n <- sum(overall.terms)
  # alpha_{kw} in Monroe et al. 
  prior.terms <- overall.terms / n * alpha.0
  # y_{kw}(i) in Monroe et al.
  cluster.terms <- colSums(dfm[clust.vect, ])
  # n_k(i) in Monroe et al.
  cluster.n <- sum(cluster.terms)
  
  cluster.term.odds <- 
    (cluster.terms + prior.terms) / 
      (cluster.n + alpha.0 - cluster.terms - prior.terms)
  overall.term.odds <- 
    (overall.terms + prior.terms) / 
      (n + alpha.0 - overall.terms - prior.terms)
  
  # usually, we'd hate to blindly log something, but as long as we have a non-zero prior, 
  # these will both be non-zero.  
  log.odds <- log(cluster.term.odds) - log(overall.term.odds)
  
  variance <- 1/(cluster.terms + prior.terms) + 1/(overall.terms + prior.terms)
  
  # return the variance weighted log-odds for each term
  return(log.odds / sqrt(variance))
}

We also need a function, clusterLabels that can call clusterFightingWords for each cluster and package the results into a tidy set of labels.

# dfm -- a quanteda document-term object
# results -- a k-means results object
# n.terms -- the number of terms to include in the label
clusterLabels <- function(dfm, results, n.terms=10) {
  clusters <- length(results$withinss)
  labels <- rep("", clusters)
  for (clust.num in 1:clusters) {   
    clust.vect <- results$cluster==clust.num
    terms <- clusterFightinWords(dfm, clust.vect)
    terms <- order(terms, decreasing = T)[1:n.terms]
    # use the terms as indices on the features list of the dfm
    labels[clust.num] <- paste(colnames(dfm)[terms], collapse=", ")
  }
  return(labels)
}

# In case we want to print tidy versions of the labels.  
# I don't use this here, but sprintf() is super useful to know.
printLabels <- function(labels, results){
  for (clust.num in length(labels)) {
    # sprintf() is a little wonky, but a super powerful tool for formatting your text output
    cat(sprintf("%2i) %5.0f Members | %s", 
                  clust.num, 
                  sum(results$cluster==clust.num),
                  labels[clust.num] 
          )
    )
  }
}

Now that we’ve functions defined to do the heavy lifting, it’s easier to compare our different clusterings. Note that the numbering of the clusters might be different even if clusters are very similar.

First, let’s look at our three 10-cluster runs:

  clusterLabels(dfm=mydfm, results=kmeans.results.10, 10)
##  [1] "mobility, occupational, intergenerational, class, origin, models, patterns, origins, career, industrial"        
##  [2] "labor, inequality, economic, market, income, growth, workers, countries, foreign, earnings"                     
##  [3] "women, men, gender, women's, marriage, female, sex, men's, married, gap"                                        
##  [4] "political, democracy, protest, party, state, elites, class, democratic, interests, business"                    
##  [5] "school, children, family, effects, educational, parental, parents, age, children's, marital"                    
##  [6] "status, task, characteristics, occupational, characteristic, belief, beliefs, prestige, attainment, interaction"
##  [7] "religious, religion, attendance, church, participation, nations, beliefs, involvement, membership, economies"   
##  [8] "racial, black, blacks, segregation, white, whites, residential, race, composition, areas"                       
##  [9] "organizations, organizational, sociology, theory, can, cultural, historical, law, model, article"               
## [10] "social, movement, network, movements, theory, networks, action, capital, ties, exchange"

Do these labels make sense to you? Try it again with different numbers of clusters.

8.1.1 Cluster-level word clouds

We can also get a high-level view of the cluster content by building a word cloud of just the documents in the cluster. Note that there will use the raw counts of words, rather than pulling out the words that most specifically apply to the cluster of interest. It won’t be surprising if words like “social” appear in most of the word clouds for these docs. Instead of what words set this cluster apart, like our fightin’ words label provides, the word clouds show how prevelant words are within the cluster, taken in isolation.

textplot_wordcloud(mydfm[kmeans.results.10$cluster==1,], max.words=30)

8.2 Reading Documents

Once we label something, we have a habit of letting that label stand in for the thing that it labels. It’s important, before we get too attached to any labeling of a set of documents, to confirm that what we infer from the label is equivalent to what we infer from teh documents.

Let’s grab the cluster labels for our first 10-cluster solution and a few abstracts from each cluster and see how they compare.

labels <- clusterLabels(mydfm, kmeans.results.10, 10)
# add cluster numbers to the corpus documents
corpus$documents$cluster <- kmeans.results.10$cluster

Let’s make a convenience function for displaying a sample of docs from a given cluster.

printCorpusCluster <- function(corpus, clust.num, num.docs, labels) {  
    print(labels[clust.num])
    cluster.corpus <- corpus_subset(corpus, cluster==clust.num)
    # sample some docs from the cluster sub corpus
    cluster.docs <- texts(cluster.corpus)[sample(x=1:nrow(cluster.corpus$documents), num.docs)]
    return(cluster.docs)
}

And look at how these compare to some of the labels

printCorpusCluster(corpus, 1, num.docs=3, labels)
## [1] "mobility, occupational, intergenerational, class, origin, models, patterns, origins, career, industrial"
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Joel M. Podolny, James N. Baron 1997 
##                                                                                                                                                                                                                                                                                                                                            "We examine how the structure and content of individuals' networks in the workplace affect intraorganizational mobility. Consistent with prior research, we find that an individual's mobility is enhanced by having a large, sparse network of informal ties for acquiring information and resources. However, in contrast to previous work, we emphasize the importance of consistent role expectations for performance and mobility. We find evidence that well-defined performance expectations are more likely to arise from a small, dense network of individuals. We develop a typology of network contents and document the interaction between network structure and content in analyses of mobility among employees of a high-technology firm. We also examine how the effects of tie duration on mobility vary by tie content. We discuss the implications of our results for theory and research on networks and organizational mobility." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             C. Matthew Snipp 1985 
## "This paper reviews several neo-Weberian ideas about social class for understanding patterns of men's career mobility. Special attention is devoted to the non-Marxist concepts of social class advocated by Giddens, Sorokin, and Parkin. Further insights are provided by an analysis of data for 20,000 men in the experienced American civilian labor force. The data are fit with a loglinear model and then further analyzed with multidimensional scaling techniques. The data analysis discloses three broad categories of social class that consist of farmers, manual, and nonmanual occupations. The findings also reveal further subdivisions in which professionals and skilled craft occupations form distinct classes within the nonmanual and manual categories. A third, much smaller, subdivision exists between semi-skilled operatives and unskilled laborers. These findings support neo-Weberian social class concepts. These results are also compared with Breiger's (1981) research on intergenerational mobility. This comparison shows that patterns in career mobility reveal a simpler picture of class structure than the eight class categories that Breiger reports for intergenerational mobility. The conceptual implications of these findings are discussed." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Peter S. Bearman, Glenn Deane 1992 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                "Models that allow the decomposition of mobility into its structural and exchange components are used to identify the structure, and consequences, of middle-class intergenerational mobility in preindustrial (1548-1689) Norwich, England. Dramatic shifts in the mobility opportunities of sons over time are seen to yield distinct political outcomes. Political stability is associated with almost universal upward mobility in the period from 1548 to 1589, while from 1590 to 1639 structural processes leading to massive downward mobility are associated with increased radicalization and participation in the English civil war. From 1640 to 1689, strata persistence in an unstable political context prevailed."

8.3 Visualizing Clusters with Covariates of Interest

We might want to see how the two journals compare in terms of the clusters of documents.

  corpus$documents$cluster <- kmeans.results.10$cluster
  labels <- clusterLabels(mydfm, kmeans.results.10, 3)
  # package the labels into a lookup dataframe so it can be merged into the corpus and used in the graph
  label.df <- data.frame(cluster=1:length(labels), cluster.name=labels)
  corpus$documents %>%
    left_join(label.df) %>%
    ggplot() +
      geom_bar(aes(x=cluster.name, fill=journaltitle), position="dodge", width=0.75) +
      xlab("Cluster") +
      ylab("Number of Articles") +
      # the labels are loooonnnngggg, so rotate them 45 degrees
      theme(axis.text.x=element_text(angle=45, hjust=1))
## Joining, by = "cluster"

9 Topic Models

This section introduces the Structural Topic Model (stm) package. Most of it is drawn directly from the package’s Vignette, which goes through these and other features in even greater depth. I’ll also show how to fit a topic model using standard LDA through the Mallet package’s Gibbs sampling algorithm.

We’re going to be working with Eisenstein and Xing’s (2010) corpus of political blog posts from 2008. This will give us two covariates of interest: the liberal/conservative rating and the day during 2008 the blog was published.

9.1 Preprocessing Data

We start by loading stm and the dataset. The textProcessor() function will use the tm package’s text processing tools to stem and remove stop words in the documents. It will return the documents divided into the documents, vocab, and meta formats that we saw working with the poliblog5k sample. Check ?textProcessor for options, as it allows you to change stemming options, alter the language used for stop word identification, or even pass in your own custom stop word list.

library(stm)
library(data.table)
poliblog <- as.data.frame(data.table::fread("poliblogs2008.csv"))
# default: removes stop words in english, stems words, removes words under 3 characters, puts everything to lower case
processed.docs <- textProcessor(poliblog$documents, metadata = poliblog)
## Building corpus... 
## Converting to Lower Case... 
## Removing punctuation... 
## Removing stopwords... 
## Removing numbers... 
## Stemming... 
## Creating Output...
str(processed.docs, list.len=10)
## List of 4
##  $ documents   :List of 13246
##   ..$ 1    : int [1:2, 1:127] 10333 1 11536 1 11726 1 13018 1 13228 1 ...
##   ..$ 2    : int [1:2, 1:189] 11860 1 12037 1 13139 1 13280 1 13768 1 ...
##   ..$ 3    : int [1:2, 1:121] 10364 2 11139 1 12814 1 13228 1 13280 2 ...
##   ..$ 4    : int [1:2, 1:135] 9815 1 10121 1 10234 1 10585 3 11703 1 ...
##   ..$ 5    : int [1:2, 1:113] 10168 1 10513 1 11149 1 12500 1 13254 1 ...
##   ..$ 6    : int [1:2, 1:203] 397 1 10513 1 10675 1 11248 1 11373 1 ...
##   ..$ 7    : int [1:2, 1:172] 11536 1 11860 1 13114 1 13177 1 13768 2 ...
##   ..$ 8    : int [1:2, 1:236] 187 1 10008 1 10207 1 11312 1 11821 1 ...
##   ..$ 9    : int [1:2, 1:159] 3184 1 10364 1 10932 1 11749 2 12946 2 ...
##   ..$ 10   : int [1:2, 1:176] 10802 1 11821 1 13095 1 13228 1 13567 4 ...
##   .. [list output truncated]
##  $ vocab       : chr [1:123990] "---" "---backyard" "---bootstrap" "---box" ...
##  $ meta        :'data.frame':    13246 obs. of  6 variables:
##   ..$ V1       : chr [1:13246] "1" "2" "3" "4" ...
##   ..$ documents: chr [1:13246] "After a week of false statements, lies, and dismissive apologies, Pakistani President Pervez Musharraf now says"| __truncated__ "I honestly don't know how either party's caucus results will play out tonight. Usually, you can get a good idea"| __truncated__ "While we stand in awe of the willingness of our troops in Iraq to sacrifice themselves for the Nation, the fewe"| __truncated__ "These pages recently said goodbye to global warming.  Ironically, the current spell of global warming, such as "| __truncated__ ...
##   ..$ docname  : chr [1:13246] "at0800300_1.text" "at0800300_2.text" "at0800300_3.text" "at0800300_4.text" ...
##   ..$ rating   : chr [1:13246] "Conservative" "Conservative" "Conservative" "Conservative" ...
##   ..$ day      : int [1:13246] 3 3 3 3 3 3 4 4 4 4 ...
##   ..$ blog     : chr [1:13246] "at" "at" "at" "at" ...
##  $ docs.removed: int(0) 
##  - attr(*, "class")= chr "textProcessor"

As we saw above, we often want remove the rarer words from our corpus so the algorithm can focus on words that occur in many different documents. The prepDocuments() function will do that for us. The plotRemoved function will let us preview how many documents, words, and tokens we’ll remove by setting the threshold number of documents to various levels.

# let's look at 0 to 1% of the documents (130)
plotRemoved(processed.docs$documents, lower.thresh = seq(1, 130, by = 5))

It looks like the majority of the words this trims out are removed at around a threshold of 20 documents. We’ll go with that.

prepDocuments will handle reindexing the vocabulary and the metadata as words are removed. This is a royal pain to do by hand–shoulders of giants here.

prepped.docs <- prepDocuments(processed.docs$documents, 
                     processed.docs$vocab, 
                     processed.docs$meta,
                     lower.thresh = 20)
## Removing 115708 of 123990 terms (246536 of 2298953 tokens) due to frequency 
## Your corpus now has 13246 documents, 8282 terms and 2052417 tokens.

9.2 Interlude: Know Your (Meta) Data

Let’s take a quick look at the metadata in our dataset: the blog from which the post comes, and the day the post was made.

# Which blogs are which?
table(prepped.docs$meta$blog, prepped.docs$meta$rating)
##      
##       Conservative Liberal
##   at          3197       0
##   db             0    1879
##   ha          3708       0
##   mm           677       0
##   tp             0    2080
##   tpm            0    1705
# Conseratives and Liberals over time
table(prepped.docs$meta$rating, round(prepped.docs$meta$day/30, 0))
##               
##                  0   1   2   3   4   5   6   7   8   9  10  11  12
##   Conservative 252 418 576 645 737 625 643 662 771 667 638 547 401
##   Liberal      187 392 374 288 333 424 477 493 589 693 644 463 307
# Blogs over time
table(prepped.docs$meta$blog, round(prepped.docs$meta$day/30, 0))
##      
##         0   1   2   3   4   5   6   7   8   9  10  11  12
##   at  120 208 266 237 294 267 264 262 294 278 283 238 186
##   db   78 118 103 110 131 114 149 155 206 235 216 156 108
##   ha   98 146 268 359 352 299 319 357 404 333 310 267 196
##   mm   34  64  42  49  91  59  60  43  73  56  45  42  19
##   tp   76 163 176  84 100 181 196 188 215 224 212 160 105
##   tpm  33 111  95  94 102 129 132 150 168 234 216 147  94

9.3 Fitting a Structural Topic Model

The stm function does the really heavy lifting, fitting a topic model to our data. There are more advanced settings for altering the Bayesian prior or the initialization algoritms used, but I’m going to focus on just a handful of features in this lab.

K sets the number of topics. As we’ll discuss in lecture, and as covered in the readings, choosing this is an important part of choosing a model.

Several of the options are worth mentioning/discussing here:

  1. The overall topic prevalence (the percentage of a random docuemnt representing each possible topic) can be modeled as a function of covariates of the document. Here, we use the R model specification idioms to make it dependent on the rating (partisanship) of the blog author and a spline function of the day, s(day), which will allow the overall topic prevalence to change over the course of 2008.

  2. The content =~ term would allow us to set a single categorical variable over which the word distibution of topics can be allowed to vary. In this case, we might allow each of the six different authors to speak about the same topic using slightly different distributions of words. We’re going to skip over this feature for this lab, but you can find out more about it in the various STM papers by Roberts et al. If you do explore it, keep in mind that it can have very large impacts on your model and requires that each category for which you want to allow separate content estimates contain a fairly large number of documents.

  3. max.em.its tells stm the maximum number of iterations to use in the EM algorithm before giving up on it ever converging.

  4. init.type=“Spectral” tells stm to initialize the EM algorithm using the spectral decomposition of the term co-ocurrence matrix. The spectral initialization uses a deterministic algorithm to find the best starting point for STM’s main EM algorithm. In many of the computational and human-judged ways that we’d want to compare models, it tends to produce better models than a random initialization.

poliblog.fit <- 
  stm(prepped.docs$documents, 
      prepped.docs$vocab, 
      K = 20, # number of topics
      prevalence =~ rating + s(day),  
      max.em.its = 75,
      data = prepped.docs$meta, 
      init.type = "Spectral")
save(poliblog.fit, file="poliblog_fit.Rdata")
load("poliblog_fit.Rdata")

9.4 Model Overview

The first step in exploring the topic model result is to see what’s going on in the topics so that we can start to get a sense of what features of the corpus they might reflect.

Calling plot.STM() on the fit object will show us the topics and their relative shares at a glance.

plot.STM(poliblog.fit)

9.4.1 Content of Topics

Of course, these three-word labels don’t give us much insight into the model. Let’s dig into two of the topics to see what’s going on.

The labelTopics command will give us the top 7 words (by default) of each topic, using four different metrics for ranking the top words. If we’d included a content covariate, then we’d need to call sageLabels(poliblog.fit) to get the same information; it would average across the content groups to give both overall word lists for each topic and a word list for each content covariate category within each topic.

Prob ranks words by their raw prevalence in the topic, frex takes the harmonic mean of raw prevalence and exclusivity. Score is taken from the lda package and lift comes out of the multinomial inverse regression we looked at last week.

Personally, with this model (on this corpus), I find the FREX (frequency/exclusivity) lists to be the most useful. When we look at LDAVis below, this is similar to what the lambda slider does to the word lists.

labelTopics(poliblog.fit)
## Topic 1 Top Words:
##       Highest Prob: obama, campaign, barack, mccain, biden, will, senat 
##       FREX: biden, barack, obama, joe, debat, campaign, team 
##       Lift: oct, goolsbe, biden, celeb, ifil, wurzelbach, obamabiden 
##       Score: oct, obama, barack, biden, campaign, mccain, ayer 
## Topic 2 Top Words:
##       Highest Prob: obama, mccain, poll, voter, democrat, state, lead 
##       FREX: poll, pennsylvania, virginia, percent, margin, elector, voter 
##       Lift: composit, zogbi, quinnipiac, gallup, todaygallup, outlier, outspend 
##       Score: composit, poll, mccain, obama, voter, rasmussen, percent 
## Topic 3 Top Words:
##       Highest Prob: one, time, get, stori, question, doesnt, media 
##       FREX: didnt, doesnt, reader, cant, isnt, blogger, blog 
##       Lift: bookshelf, livechat, registeredw, handychristma, ustream, watchupd, readership 
##       Score: registeredw, doesnt, didnt, media, exit, drudg, reader 
## Topic 4 Top Words:
##       Highest Prob: democrat, senat, republican, bill, vote, hous, legisl 
##       FREX: legisl, pelosi, reid, lieberman, bipartisan, nanci, speaker 
##       Lift: hoyer, legisl, steni, clotur, reid, pelosi, dcalif 
##       Score: legisl, republican, senat, vote, pelosi, reid, lieberman 
## Topic 5 Top Words:
##       Highest Prob: american, america, peopl, polit, will, countri, can 
##       FREX: wright, black, racist, pastor, reagan, liber, church 
##       Lift: saul, negro, reverend, perlstein, pfleger, farrakhan, sermon 
##       Score: saul, wright, church, black, jeremiah, pastor, racial 
## Topic 6 Top Words:
##       Highest Prob: elect, vote, voter, republican, state, ballot, democrat 
##       FREX: franken, ballot, coleman, registr, chambliss, minnesota, seat 
##       Lift: hagan, nrsc, chambliss, elia, gopheld, saxbi, wicker 
##       Score: elia, franken, coleman, ballot, registr, voter, vote 
## Topic 7 Top Words:
##       Highest Prob: one, will, film, allah, gay, christian, book 
##       FREX: film, allah, muhammad, gay, marriag, qur, hage 
##       Lift: aljalalayn, hartley, ibn, qur, yusuf, filmmak, kathir 
##       Score: meccan, film, allah, qur, muhammad, sura, ibn 
## Topic 8 Top Words:
##       Highest Prob: group, million, money, organ, chicago, school, public 
##       FREX: chicago, donat, school, rezko, student, ayer, donor 
##       Lift: voucher, daley, annenberg, dohrn, rezko, bernadin, bernardin 
##       Score: voucher, ayer, chicago, rezko, school, acorn, donat 
## Topic 9 Top Words:
##       Highest Prob: oil, energi, price, will, drill, gas, product 
##       FREX: drill, oil, energi, mugab, china, gas, price 
##       Lift: refineri, tsvangirai, mugab, thabo, zimbabw, mdc, opec 
##       Score: mbeki, oil, mugab, drill, energi, price, gas 
## Topic 10 Top Words:
##       Highest Prob: iran, israel, nuclear, terrorist, state, iranian, terror 
##       FREX: hama, palestinian, israel, isra, iran, iranian, nuclear 
##       Lift: damascus, fatah, iaea, lebanes, assad, bashar, hama 
##       Score: fatah, iran, israel, hama, iranian, isra, palestinian 
## Topic 11 Top Words:
##       Highest Prob: global, warm, abort, chang, studi, research, climat 
##       FREX: warm, abort, climat, scientist, scienc, scientif, birth 
##       Lift: dioxid, embryon, ipcc, alarmist, scientif, anthropogen, scientist 
##       Score: embryon, abort, global, warm, climat, emiss, scienc 
## Topic 12 Top Words:
##       Highest Prob: iraq, war, iraqi, militari, troop, will, forc 
##       FREX: iraqi, iraq, maliki, petraeus, sadr, troop, baghdad 
##       Lift: almaliki, alsadr, basra, centcom, drawdown, fallon, nouri 
##       Score: fallon, iraq, iraqi, troop, maliki, sadr, petraeus 
## Topic 13 Top Words:
##       Highest Prob: clinton, hillari, will, democrat, primari, campaign, parti 
##       FREX: hillari, clinton, romney, huckabe, deleg, primari, edward 
##       Lift: superdel, bylaw, superdeleg, wolfson, rodham, frontrunn, huckabe 
##       Score: bylaw, hillari, clinton, romney, deleg, huckabe, superdeleg 
## Topic 14 Top Words:
##       Highest Prob: law, court, investig, tortur, case, state, depart 
##       FREX: tortur, blagojevich, interrog, detaine, attorney, court, justic 
##       Lift: habea, antonin, mukasey, siegelman, addington, yoo, guantánamo 
##       Score: addington, blagojevich, attorney, tortur, detaine, court, interrog 
## Topic 15 Top Words:
##       Highest Prob: govern, will, attack, russia, pakistan, georgia, russian 
##       FREX: russian, musharraf, pakistan, pakistani, taliban, russia, nato 
##       Lift: benazir, bhutto, isi, islamabad, ossetia, putin, tbilisi 
##       Score: gori, pakistan, russian, russia, taliban, musharraf, pakistani 
## Topic 16 Top Words:
##       Highest Prob: think, like, know, say, peopl, just, thing 
##       FREX: linktocommentspostcount, postcounttb, matthew, guy, digbi, oreilli, think 
##       Lift: somerbi, tina, digbyim, tweeti, gasbag, dooci, macho 
##       Score: tina, linktocommentspostcount, postcounttb, matthew, oreilli, think, digbi 
## Topic 17 Top Words:
##       Highest Prob: mccain, palin, john, said, sen, campaign, sarah 
##       FREX: palin, sarah, raz, mccain, fox, alaska, sen 
##       Lift: holtzeakin, pfotenhau, whiner, scriptalreadyrequest, acflruncont, allowfullscreentru, allowscriptaccessalway 
##       Score: whiner, mccain, palin, raz, sarah, sen, alaska 
## Topic 18 Top Words:
##       Highest Prob: said, citi, polic, report, immigr, offic, kill 
##       FREX: polic, immigr, taser, incid, flight, san, protest 
##       Lift: taser, berkeley, deport, capt, vandal, passeng, escort 
##       Score: taser, polic, immigr, citi, murder, spitzer, arrest 
## Topic 19 Top Words:
##       Highest Prob: tax, will, economi, econom, govern, plan, american 
##       FREX: tax, mortgag, financi, fanni, billion, economi, loan 
##       Lift: homeownership, mortgageback, lender, efca, fdic, gses, mae 
##       Score: efca, tax, mortgag, bailout, billion, economi, loan 
## Topic 20 Top Words:
##       Highest Prob: bush, presid, administr, hous, white, said, polici 
##       FREX: bush, cheney, presid, administr, secretari, georg, white 
##       Lift: methodist, perino, podesta, mcclellan, cheney, powel, bush 
##       Score: methodist, bush, presid, administr, cheney, perino, white

9.4.2 Reading Representative Documents

STM will also help us call up representative documents to read to make sure our topics mean what we think they mean. The findThoughts() function will pull up the top n documents associated with each topic.

# our readable document text is buried in the $meta object, and stored as a factor
doc.text <- as.character(prepped.docs$meta$documents)
sample.docs <- findThoughts(poliblog.fit, 
                            texts = doc.text, 
                            n = 3, 
                            topics = c(1,3,11))
# the first topic we grabbed: 1, Obama
sample.docs$docs[[1]]
## [1] "At a press conference just now, Barack Obama rejected John McCain's demand for a suspension of the debate.  \"\"I believe we should continue to have the debate,\"\" he just said. \"\"I believe it makes sense for us to present ourselves to the American people.\"\"  \"\"Obviously if it turns out that we need to be in Washington, we've both got big planes, we've painted our slogan on the side of them,\"\" Obama also said. \"\"They can get us from Washington to Mississippi pretty quickly.\"\" The debate is set to take place in Mississippi.  Obama also said that he was blindsided by McCain's public call for a debates suspension. After describing their conversation about a possible suspension, Obama said: \"\"I thought that this was something that he was mulling over. Apparently this was something that he was more decisive about in his own mind.\"\"  Obama described their conversation as follows: \"\"I proposed putting out the joint statement. He concurred with that. he then also said, 'I would like us to look at suspending the campaign and pushing the debates off.' I said, 'let's put out the joint statement first, and then get our campaigns to discuss this.'\"\" Obama said he later saw McCain announcing his plans on television.  One more time: If this version of events is true, McCain's public call for a suspension was anything but apolitical. If McCain had truly intended to keep this apolitical, he would have asked Obama to jointly suspend the debates, made his own full intentions clear, and waited for Obama's private and definitive answer before going public."
## [2] "Hillary is already out with a statement hitting Obama over  the Associated Press story reporting that a Canadian official said in a memo that Obama economic adviser Antan Goolsbee privately suggested that Obama's NAFTA stump talk was \"\"more about political positioning than a clear articulation of policy plans.\"\"   Here's her statement...  I think that after days of denial, the Obama campaign was confronted with a memo of a meeting -- it was my understanding -- in which there was a discussion of NAFTA. And it  raises questions about Senator Obama coming to Ohio and giving  speeches about NAFTA and having his chief economic advisor tell the Canadian  government that it was just political rhetoric.   I donâ\u0080\u0099t think  people should come to Ohio and tell the people of Ohio one thing and  then have your campaign tell a foreign government something else behind closed  doors.  Goolsbee has disputed the memo's characterization of the conversation, saying: \"\"In no possible way was I inferring that he was going to introduce any policies that you should ignore and he had no intention of enacting.\"\"  It's unclear how this latest turn in the whole saga will play in Ohio, where NAFTA is a highly-charged issue, or whether it's coming too late to move votes there."                                                                                                                                                                                                                                                                                                         
## [3] "The McCain campaign hits Obama over Wes Clark's comments for the third day running, releasing yet another statement...  \"\"Yesterday, Barack Obama's campaign said he rejected Gen. Clark's attack on John McCain's military service. But last night, Gen. Clark admitted to speaking with the Obama campaign, and then went out and repeated his attacks. It's clear that the Obama campaign isn't telling Wes Clark to apologize, and are either encouraging or tolerating his attacks on John McCain's military service.   The Obama campaign even said they were 'glad' that Gen. Clark 'clarified' a comment they supposedly repudiated. If this kind of wink-and-nod game is how Barack Obama wants to run his campaign, then fine. But spare us the empty talk of 'new politics' and raising the dialogue in this country. We just wonder: Will Barack Obama's actions ever match his words?\"\"   It's really unclear what exactly the McCain campaign is pretending to be outraged about. Whatever the wisdom of Obama's repudiation of Clark's comments, this is pretty simple: Obama doesn't stand by Clark's comments, and Clark does. Clark communicated as much to the Obama campaign.  How does this prove that Obama is \"\"encouraging or tolerating\"\" the attacks? It doesn't, of course. At this point McCain advisers are simply manufacturing outrage, secure in the knowledge that they won't get called out for it. Oh, and by the way, Clark didn't \"\"attack\"\" McCain's military service."
# the second topic: 3, meta-media
sample.docs$docs[[2]]
## [1] "The Northern Alliance Radio Network will be on the air today, with our eight-hour-long broadcast schedule starting at 9 am CT.  If you’re in the Twin Cities, you can hear us on AM 1280 The Patriot, or on the station’s Internet stream if you’re outside of the broadcast area.Today, I’m traveling to the Left Coast for the holidays — and to see Notre Dame play USC at the Colisseum — but Mitch will have all the thrills and chills of our usual show.  We won’t have a live video stream, but the chat room will be up & operational, as long as Ustream doesn’t flop like it did yesterday.Be sure to call 651-289-4488 to join the conversation!Mitch and I now live-stream our show from 1-3 pm CT on Ustream, complete with live chat!  Be sure to register for free at Ustream to participate in our raucous live-chat sessions.    (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.)Want to see all episodes of The Ed Morrissey Show?  Add the RSS feed to your reader and keep them handy!Christmas season is coming!  Why not shop through my Amazon widgets?  Get great deals delivered to your door.  Check out Ed’s Bookshelf or search for what you want:<!--</p> <p>// --><!--</p> <p>// -->"         
## [2] "Today, on the Ed Morrissey Show (3 pm ET), Duane “Generalissimo” Patterson of the Hugh Hewitt Show joins us for our Week in Review.  However, I suspect most of our time will be spent on analyzing last night’s debate between Sarah Palin and Joe Biden.  We’ll save some time to preview Hugh’s show tonight, after his marathon 6-hour broadcast last night.Special note: Today is my 5-year blogiversary!  I started Captain’s Quarters on October 2nd, 2003, and I’ve blogged constantly since.  I can’t think of a better way to celebrate it than with my friends at TEMS.  Big thanks to Jim Lynch, who first reminded me of it, and to Sister Toldjah, who is also celebrating Blogiversary #5 today.Now you can join the conversation in the chat room!  Be sure to register at Ustream to participate in our raucous live-chat sessions.    (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.)  Jazz Shaw of The Moderate Voice moderates the chat, and he has his Troll Gun Trivia Contest running before the show, for those who get there early — and are registered!Want to see all episodes of The Ed Morrissey Show?  Add the RSS feed to your reader and keep them handy!Opening animation from TG Studios."
## [3] "Today, on the Ed Morrissey Show (3 pm ET), we’re thrilled to welcome back Mary Katharine Ham.  MK and I will talk about the latest developments in the Blago-Rahma scandal, the Joe Scarborough takedown of the national media, and much more.  In the first half of the show, we will talk with Lila Rose, the young woman who has gone undercover to expose the practices of Planned Parenthood in a series of videos.  Don’t miss this!!Update: This was one of my favorite shows.  Be sure to watch it on replay below!Now you can join the conversation in the chat room!  Be sure to register at Ustream to participate in our raucous live-chat sessions.    (And if the log-in prompt doesn’t come up in the chat box below, use this link instead.)  Jazz Shaw of The Moderate Voice moderates the chat, and he has his Troll Gun Trivia Contest running before the show, for those who get there early — and are registered!Want to see all episodes of The Ed Morrissey Show?  Add the RSS feed to your reader and keep them handy!Christmas season is coming!  Why not shop through my Amazon widgets?  Get great deals delivered to your door.  Check out Ed’s Bookshelf or search for what you want:  Opening animation from TG Studios."
# and the third topic: 17, McCain
sample.docs$docs[[3]]
## [1] "Is the scientific debate over on global warming?  Not according to the American Physical Society* in this year's July's issue of Physics and Society .\"\"With this issue of Physics & Society, we kick off a debate concerning one of the main conclusions of the International Panel on Climate Change (IPCC), the UN body which, together with Al Gore, recently won the Nobel Prize for its work concerning climate change research. There is a considerable presence within the scientific community of people who do not agree with the IPCC conclusion that anthropogenic CO2 emissions are very probably likely to be primarily responsible for the global warming that has occurred since the Industrial Revolution.  Since the correctness or fallacy of that conclusion has immense implications for public policy and for the future of the biosphere, we thought it appropriate to present a debate within the pages of P&S concerning that conclusion.  This editor invited several people to contribute articles that were either pro or con.  Christopher Monckton responded ...\"\"  [Emphasis added.]And what did Lord Monckton say?\"\"Some reasons why the IPCC's estimates may be excessive and unsafe are explained.  More importantly, the conclusion is that, perhaps, there is no \"\"climate crisis\"\", and that currently-fashionable efforts by governments to reduce anthropogenic CO2 emissions are pointless, may be ill-conceived, and could even be harmful.\"\"He examined specific assumptions of the IPCC cited computer models and found that, even using the same models but with more justifiable assumptions, carbon dioxide is not a critical threat to global temperatures.\"\"Theoretically, empirically, and in the literature that we have extensively cited, each of the values we have chosen as our central estimate is arguably more justifiable - and is certainly no less justifiable - than the substantially higher value selected by the IPCC. Accordingly, it is very likely that in response to a doubling of pre-industrial carbon dioxide concentration TS will rise not by the 3.26 °K suggested by the IPCC, but by <1 °K.\"\"He concluded with \"\"If the concluding equation in this analysis is correct, the IPCC's estimates of climate sensitivity must have been very much exaggerated.  There may, therefore, be a good reason why, contrary to the projections of the models on which the IPCC relies, temperatures have not risen for a decade and have been falling since the phase-transition in global temperature trends that occurred in late 2001.  Perhaps real-world climate sensitivity is very much below the IPCC's estimates. Perhaps, therefore, there is no \"\"climate crisis\"\" at all. At present, then, in policy terms there is no case for doing anything. The correct policy approach to a non-problem is to have the courage to do nothing.\"\"  [Emphasis added.]*According to Wikipedia http://en.wikipedia.org/wiki/American_Physical_Society, the American Physical Society was founded in 1899 and is the second largest association of physicists in the world, with over 40,000 members. Update: According to http://www.aps.org/APS Climate Change StatementAPS Position Remains UnchangedThe American Physical Society reaffirms the following position on climate change, adopted by its governing body, the APS Council, on November 18, 2007:\"\"Emissions of greenhouse gases from human activities are changing the atmosphere in ways that affect the Earth's climate.\"\"An article at odds with this statement recently appeared in an online newsletter of the APS Forum on Physics and Society, one of 39 units of APS.  The header of this newsletter carries the statement that \"\"Opinions expressed are those of the authors alone and do not necessarily reflect the views of the APS or of the Forum.\"\"  This newsletter is not a journal of the APS and it is not peer reviewed.  Read: APS Climate Change Statement"
## [2] "Climate change report forecasts global sea levels to rise up to 4 feet by 2100.  \t\t\t\t\tAccording to a new report led by the U.S. Geological Survey, the U.S. “faces the possibility of much more rapid climate change by the end of the century than previous studies have suggested.” The report, commissioned by the U.S. Climate Change Science Program, found that global sea levels could rise higher than a 2007 U.N. Intergovernment Panel on Climate Change (IPCC) study had concluded:  In one of the report’s most worrisome findings, the agency estimates that in light of recent ice sheet melting, global sea levels could rise as much as 4 feet by 2100. The intergovernment panel had projected a rise of no more than 1.5 feet by that time, but satellite data over the last two years show the world’s major ice sheets are melting much more rapidly than previously thought. The Antarctic and Greenland ice sheets are losing an average of 48 cubic miles of ice a year, equivalent to twice the amount of ice in the Alps. The lead scientist for the report’s chapter on ice sheets said the models used by the IPCC “did not factor in some of the dynamics that scientists now understand about ice sheet melting” such as “a process of ‘lubrication,’ in which warmer ocean water gets underneath coastal ice sheets and accelerates melting.”           \t  Update The Wonk Room's Brad Johnson has more on ice sheet melting."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "Deathly news for the religion of Global Warming. Looks like at least one prominent scientific group has changed its mind about the irrefutability of evidence regarding man made climate change.The American Physical Society representing nearly 50,000 physicists \"\"has reversed its stance on climate change and is now proclaiming that many of its members disbelieve in human-induced global warming,\"\" according to an article in The Daily Tech. The leadership of the society had previously referred to global warming evidence as \"\"incontravertible.\"\"What does this mean? It doesn't mean that the APS has abandoned its position on man made climate change. What they have done is perhaps more significant; they have opened the subject up for debate once again. And once the facts can be judged on their merits and not on who is submitting them, at the very least global warming deniers have a powerful platform to present their findings.And the first shot across the bow is directed at the Nobel Prize winning group, the IPPC which one scientist discovered has been fudging its studies to hide the fact that increased C02 does not necessarily lead to increased temps:The APS is opening its debate with the publication of a paper by Lord Monckton of Brenchley, which concludes that climate sensitivity -- the rate of temperature change a given amount of greenhouse gas will cause -- has been grossly overstated by IPCC modeling.   A low sensitivity implies additional atmospheric CO2 will have little effect on global climate.Larry Gould, Professor of Physics at the University of Hartford and Chairman of the New England Section of the APS, called Monckton's paper an \"\"expose of the IPCC that details numerous exaggerations and \"\"extensive errors\"\"In an email to DailyTech, Monckton says, \"\"I was dismayed to discover that the IPCC's 2001 and 2007 reports did not devote chapters to the central 'climate sensitivity' question, and did not explain in proper, systematic detail the methods by which they evaluated it. When I began to investigate, it seemed that the IPCC was deliberately concealing and obscuring its method.\"\"  According to Monckton, there is substantial support for his results, \"\"in the peer-reviewed literature, most articles on climate sensitivity conclude, as I have done, that climate sensitivity must be harmlessly low.\"\"Monkton attributes temperature rise to \"\"natural variability\"\" which has been a key argument of the deniers for years. This is only one scientific organization but an important one. And while such news is unlikely to reverse the political momentum toward draconian changes, it may slow the march of science toward assisting the charlatans like Al Gore in their efforts.If so, it is big news indeed."

9.4.3 Wordclouds

As before, we can also use word clouds to get snapshot overviews of our topics of interest.

cloud(poliblog.fit, topic=1) 

cloud(poliblog.fit, topic=3)
## Warning in wordcloud::wordcloud(words = vocab, freq = vec, max.words =
## max.words, : doesnt could not be fit on page. It will not be plotted.

cloud(poliblog.fit, topic=17)

9.4.4 Comparing Topic Vocabularies

It can also be illustrative to directly compare topics of interest in terms of their vocabulary to make sure that what we think are the distinguishing characteristic of each actually do distinguish them. plot.STM() has a plot type=“perspectives” that does exactly this.

plot.STM(poliblog.fit, type="perspectives", topics=c(3,11))

Those two are pretty uninteresting because of how much the two candidates dominate them. How about topic 20: bush-presid-admin and topic 4: democrat-senate-republican?

plot.STM(poliblog.fit, type="perspectives", topics=c(4,20))

If we had run our model with a content covariate, then calling plot.STM(…, type=“perspectives”) with a single topic number would have let us see how different context categories talk about that topic differently.

9.4.5 Interactive Model Overview: LDAvis()

Finally, Sievert and Shirley’s (2014) LDAvis package is an excellent tool for taking a high-level tour of your topic model’s output.

Each topic will be represented by a circle whose size is determined by the topic’s overall prevalence. It uses a multi-dimensional scaling algorithm to place the topics using KL-divergence in their distributions over words. What this means is that topics that are close to one another in the display have more similar words than topics that are far apart.

On the right is a list of the top 30 (by default) terms overall. If you click on a topic, it will update this list to the top 30 in the topic selected, and show you the frequency of the word in the topic as well as in the corpus as a whole. By default, the order of these lists is determined by the raw word freqeucies. In the upper right is a slider for lambda, which will allow you to trade off the weight between raw word frequecnies in a topic (lambda = 1) and adjusting for the overall word frequency (lambda = 0). The authors recommend (based on testing with human coders) a value of 0.6 for most uses.

Setting out.dir = “” will output the javascript and html neceessary to run LDAvis to the directory you specify. You can also leave this term out and it will just run as a local server in the “viewer” of RStudio.

toLDAvis(poliblog.fit, 
         docs=prepped.docs$documents, 
         out.dir = "poliblog_LDAvis")

image:

9.5 Investigating Covariate/Topic Relationships

estimateEffect and plot.estimateEffect are tools in STM for investigating how our topics relate to covariates of interest.

The first step is estimate model effects from our fit object. The estimateEffect function takes a formula (we need to specify which topics as the outcome variable), the model fit to predict from, the meta data, and the type of uncertainty estimate to use. The default uncertainty setting, “Global,” incorporates topic proportion uncertainties into the displayed uncertainty estimates.

We’ll run this using the same formula with which we fit the model so we can recover both the rating and time dimensions.

fit.effects <- estimateEffect(1:20 ~ rating + s(day), 
                              poliblog.fit, 
                              meta = prepped.docs$meta, 
                              uncertainty = "Global")

9.5.1 Binary Categories

We could first get an overview of the topics on the liberal-conservative dimension.

plot.estimateEffect(fit.effects, 
                    covariate = "rating", 
                    topics = 1:20,
                    model = poliblog.fit, 
                    method = "difference",
                    cov.value1 = "Liberal", cov.value2 = "Conservative",
                    xlab = "More Conservative ... More Liberal",
                    main = "Effect of Liberal vs. Conservative",
                    xlim = c(-.1, .1), 
                    labeltype = "frex", # use stm's "frex" to rank top words
                    n = 3, # only list the top 3 words
                    verbose.labels = F, # labels get spammy with this T
                    width = 35 # allow for longer label strings
                    )

Interesting that the “meta-media” topic, 3, is the furthest partisan. Perhaps this is because one of the blogs in particular most often writes about television broadcasts?

We can run a new estimateEffect where we look at variation in just this topic by the blog name. Note that we can use a different model in estimateEffect than we fit in stm. That let’s us look at how topic prevelance varies by a covariate, even if that covariate was not part of our model structure.

blog.effects <- estimateEffect(c(3) ~ blog, 
                              poliblog.fit, 
                              meta = prepped.docs$meta, 
                              uncertainty = "Global")
plot.estimateEffect(blog.effects, 
                    covariate = "blog", 
                    topics = 3,
                    model = poliblog.fit, 
                    method = "pointestimate",
                    verbose.labels=T)

As we get closer to publishing, we probably want to clean up our graph and add custom labels using the labeltype = “custom” feature of plot.estimateEffect.

plot.estimateEffect(fit.effects, 
                    covariate = "rating", 
                    topics = c(1, 10, 18, 13, 17, 20),
                    model = poliblog.fit, 
                    method = "difference",
                    cov.value1 = "Liberal", cov.value2 = "Conservative",
                    xlab = "More Conservative ... More Liberal",
                    main = "Topic Prevalence Liberal vs. Conservative",
                    xlim = c(-.1, .1), 
                    labeltype = "custom",
                    custom.labels = c('Obama', 'Israel', 
                                      'Immigration',
                                      'Hillary Clinton',
                                      'McCain/Palin',
                                      'Bush'))

9.5.2 Continuous Covariates

We can also plot against a continuous variable (such as the “day” variable in poliblog) using method = “continuous”.

plot.estimateEffect(fit.effects, 
                    covariate = "day", 
                    method = "continuous", 
                    topics = c(1,17),
                    model = poliblog.fit, 
                    printlegend = TRUE, 
                    xlab = "Time (2008)")

9.5.3 Interactions in Prevalence Covariates

This looks a lot like the conservative/liberal frequency graphs we saw when using poliblog for supervised learning – although it’s Topic 1 (Obama) that maps to the conservative frequency and Topic 17 (McCain) that looks like the liberal frequency.

To investigate that further, we want to allow prevalence to change not only over time and by liberal/conservative, but allow those two categories to have different temporal patterns in prevalence. STM allows us to specify this as a possibility by setting an interaction in the prevalence formulation the same way we would in a regression model. Running this with “Global” uncertainty can take awhile, so here we’ll run just a couple of topics of interest.

First, we fit the estimateEffects for the interaction formulation.

interact.effects <- estimateEffect(c(1,17) ~ rating * s(day), 
                       poliblog.fit, 
                       metadata = prepped.docs$meta, 
                       uncertainty = "Global")

Then we can take a look at Topic 1

plot.estimateEffect(interact.effects, 
                    topics = 1,
                    covariate = "day", 
                    model = poliblog.fit, 
                    method = "continuous",
                    moderator = "rating", 
                    moderator.value = "Liberal", 
                    linecol = "blue",   
                    printlegend = F)
plot.estimateEffect(interact.effects, 
                    topics = 1,
                    covariate = "day", 
                    model = poliblog.fit,
                    method = "continuous", 
                    moderator = "rating", 
                    moderator.value = "Conservative", 
                    linecol = "red", 
                    add = T,
                    printlegend = F)
legend(150, .02, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red"))

So the Topic 1 differences isn’t driving the overall difference in liberal-conservative frequency we observed. How about Topic 17?

plot.estimateEffect(interact.effects, 
                    topics = 17,
                    covariate = "day", 
                    model = poliblog.fit, 
                    method = "continuous",
                    moderator = "rating", 
                    moderator.value = "Liberal", 
                    linecol = "blue", 
                    printlegend = F)
plot.estimateEffect(interact.effects, 
                    topics = 17,
                    covariate = "day", 
                    model = poliblog.fit,
                    method = "continuous", 
                    moderator = "rating", 
                    moderator.value = "Conservative", 
                    linecol = "red", 
                    add = T,
                    printlegend = F)
legend(0, .14, c("Liberal", "Conservative"), lwd = 2, col = c("blue", "red"))

So, Liberal blogs appear to rapidly increase their discussion of McCain in the second half of the year.

10 Scaling

In this Lab, we’ll look at two methods for scaling documents or authors in a unidimensional spatial model. Wordscores and Wordfish both seek to extract a single latent dimension of each text author that we consider to be motivating the author’s word choice. Wordscores is a supervised approach, requiring us to specify the pair or set of training documents that will define the dimension. Wordfish is an unsupervised approach that will extract a latent dimension, though as we’ll see, it’s not always the dimension we’re interested in.

We’ll also take a brief look at correspondence analysis, a method that will allow us to fit a multi-dimensional model to a set of documents.

We’ll be using the quanteda package for most of the work herein – it contains implementations of Wordscore, Wordfish, Naive Bayes, and Correspondence Analysis.

library(quanteda)

10.1 Wordscores

We’ll start with the simple, non-parametric Wordscores approach.

First, let’s take a look at U.S. presidential inaugural speeches during the 20th Century. The full corpus of speeches in included in the quanteda package.

summary(data_corpus_inaugural)
## Corpus consisting of 58 documents:
## 
##             Text Types Tokens Sentences Year  President       FirstName
##  1789-Washington   625   1538        23 1789 Washington          George
##  1793-Washington    96    147         4 1793 Washington          George
##       1797-Adams   826   2578        37 1797      Adams            John
##   1801-Jefferson   717   1927        41 1801  Jefferson          Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson          Thomas
##     1809-Madison   535   1263        21 1809    Madison           James
##     1813-Madison   541   1302        33 1813    Madison           James
##      1817-Monroe  1040   3680       121 1817     Monroe           James
##      1821-Monroe  1259   4886       129 1821     Monroe           James
##       1825-Adams  1003   3152        74 1825      Adams     John Quincy
##     1829-Jackson   517   1210        25 1829    Jackson          Andrew
##     1833-Jackson   499   1269        29 1833    Jackson          Andrew
##    1837-VanBuren  1315   4165        95 1837  Van Buren          Martin
##    1841-Harrison  1896   9144       210 1841   Harrison   William Henry
##        1845-Polk  1334   5193       153 1845       Polk      James Knox
##      1849-Taylor   496   1179        22 1849     Taylor         Zachary
##      1853-Pierce  1165   3641       104 1853     Pierce        Franklin
##    1857-Buchanan   945   3086        89 1857   Buchanan           James
##     1861-Lincoln  1075   4006       135 1861    Lincoln         Abraham
##     1865-Lincoln   360    776        26 1865    Lincoln         Abraham
##       1869-Grant   485   1235        40 1869      Grant      Ulysses S.
##       1873-Grant   552   1475        43 1873      Grant      Ulysses S.
##       1877-Hayes   831   2716        59 1877      Hayes   Rutherford B.
##    1881-Garfield  1021   3212       111 1881   Garfield        James A.
##   1885-Cleveland   676   1820        44 1885  Cleveland          Grover
##    1889-Harrison  1352   4722       157 1889   Harrison        Benjamin
##   1893-Cleveland   821   2125        58 1893  Cleveland          Grover
##    1897-McKinley  1232   4361       130 1897   McKinley         William
##    1901-McKinley   854   2437       100 1901   McKinley         William
##   1905-Roosevelt   404   1079        33 1905  Roosevelt        Theodore
##        1909-Taft  1437   5822       159 1909       Taft  William Howard
##      1913-Wilson   658   1882        68 1913     Wilson         Woodrow
##      1917-Wilson   549   1656        59 1917     Wilson         Woodrow
##     1921-Harding  1169   3721       148 1921    Harding       Warren G.
##    1925-Coolidge  1220   4440       196 1925   Coolidge          Calvin
##      1929-Hoover  1090   3865       158 1929     Hoover         Herbert
##   1933-Roosevelt   743   2062        85 1933  Roosevelt     Franklin D.
##   1937-Roosevelt   725   1997        96 1937  Roosevelt     Franklin D.
##   1941-Roosevelt   526   1544        68 1941  Roosevelt     Franklin D.
##   1945-Roosevelt   275    647        26 1945  Roosevelt     Franklin D.
##      1949-Truman   781   2513       116 1949     Truman        Harry S.
##  1953-Eisenhower   900   2757       119 1953 Eisenhower       Dwight D.
##  1957-Eisenhower   621   1931        92 1957 Eisenhower       Dwight D.
##     1961-Kennedy   566   1566        52 1961    Kennedy         John F.
##     1965-Johnson   568   1723        93 1965    Johnson   Lyndon Baines
##       1969-Nixon   743   2437       103 1969      Nixon Richard Milhous
##       1973-Nixon   544   2012        68 1973      Nixon Richard Milhous
##      1977-Carter   527   1376        52 1977     Carter           Jimmy
##      1981-Reagan   902   2790       128 1981     Reagan          Ronald
##      1985-Reagan   925   2921       123 1985     Reagan          Ronald
##        1989-Bush   795   2681       141 1989       Bush          George
##     1993-Clinton   642   1833        81 1993    Clinton            Bill
##     1997-Clinton   773   2449       111 1997    Clinton            Bill
##        2001-Bush   621   1808        97 2001       Bush       George W.
##        2005-Bush   773   2319       100 2005       Bush       George W.
##       2009-Obama   938   2711       110 2009      Obama          Barack
##       2013-Obama   814   2317        88 2013      Obama          Barack
##       2017-Trump   582   1660        88 2017      Trump       Donald J.
## 
## Source:  Gerhard Peters and John T. Woolley. The American Presidency Project.
## Created: Tue Jun 13 14:51:47 2017
## Notes:   http://www.presidency.ucsb.edu/inaugurals.php
# let's look at 20th- and 21st-century presidents
inaug.dfm <- dfm(corpus_subset(data_corpus_inaugural, Year>1901), remove_punct=TRUE)

For Wordscores, we need to specify at least two anchors that will form the end points of our single-dimension reduction of the data. Let’s start with Roosevelt’s 1937 speech our left-most and Reagan’s first speech our right-most points.

docnames(inaug.dfm)
##  [1] "1905-Roosevelt"  "1909-Taft"       "1913-Wilson"    
##  [4] "1917-Wilson"     "1921-Harding"    "1925-Coolidge"  
##  [7] "1929-Hoover"     "1933-Roosevelt"  "1937-Roosevelt" 
## [10] "1941-Roosevelt"  "1945-Roosevelt"  "1949-Truman"    
## [13] "1953-Eisenhower" "1957-Eisenhower" "1961-Kennedy"   
## [16] "1965-Johnson"    "1969-Nixon"      "1973-Nixon"     
## [19] "1977-Carter"     "1981-Reagan"     "1985-Reagan"    
## [22] "1989-Bush"       "1993-Clinton"    "1997-Clinton"   
## [25] "2001-Bush"       "2005-Bush"       "2009-Obama"     
## [28] "2013-Obama"      "2017-Trump"
# so, we want the 9th document and the 20th
# scores tells wordscores the end points of the scale
fit <- textmodel_wordscores(inaug.dfm[c(9,20)],y=c(-1,1))

# Then we can apply our fitted model to the full data set
word.scores <- predict(fit, newdata=inaug.dfm)
# see below regarding the "@" symbol
ws.data.frame <- word.scores@textscores
ggplot(ws.data.frame, aes(x=rownames(ws.data.frame), y=textscore_raw)) +
    geom_errorbar(aes(ymin=textscore_raw_lo, ymax=textscore_raw_hi), width=.1) +
    geom_point() +
    theme(axis.text.x=element_text(angle=45, hjust=1))

Note the message from predict that only a subset of the features can be scored when we run predict. This means that only that subset of words appears in one of the training documents – the remaining words in the feature set appear in other documents, but not the two we train on.

It’s likely that wordscore will put a lot of emphasis on words that appear in one of the training docs, but not in the other. In our case, each doc has ~450 words, but they only share 245 out of the 682 total between them.

# distinct words in Roosevelt
sum(as.numeric(inaug.dfm[9,])>0)
## [1] 683
# distinct words in Reagan
sum(as.numeric(inaug.dfm[20,])>0)
## [1] 843
# distinct words in both
sum(as.numeric(inaug.dfm[20,])>0 & as.numeric(inaug.dfm[9,])>0)
## [1] 256
# distinct words in either
sum(as.numeric(inaug.dfm[20,])>0 | as.numeric(inaug.dfm[9,])>0)
## [1] 1270

10.1.1 Side Note: the “@” Symbol

Above, we accessed the results from Wordscore using the @ symbol instead of the usual $ in **word.scores@textscores**. Many R objects are built with S3, which mimics some of the properties of an object-oriented class system. However, when full class functionality is required, the objects are built in S4 (deployed through the methods package). S3 objects (which include lists and data.frames) use the $ notation, while S4 objects use the @. If you’re more interested in the differences between the two approaches, this presentation by Friedrich Leisch offers a good orientation.

10.1.2 Varying the End Points

Let’s see what happens if we pin down the speeches with other end points. Maybe there was something idiosyncratic about this pair.

This is going to be easier to try out different endpoints if we wrap the call in a function.

wordScoreMe <- function(df, indices, y=c(-1,1)) {
  fit <- textmodel_wordscores(df[indices,], y=y)
  
  # Then we can apply our algorithm to the full data set
  word.scores <- predict(fit, newdata=df)
  ws.data.frame <- word.scores@textscores
  ggplot(ws.data.frame, aes(x=rownames(ws.data.frame), y=textscore_raw)) +
      geom_errorbar(aes(ymin=textscore_raw_lo, ymax=textscore_raw_hi), width=.1) +
      geom_point() +
      theme(axis.text.x=element_text(angle=45, hjust=1))  
}

And now we can easily produce the scaled graph for different end points.

# how about the first roosevelt speech and second reagan?
wordScoreMe(inaug.dfm, c(19,20))

# reagan and obama?
wordScoreMe(inaug.dfm, c(27,21))

Each time, we see the same pattern: our two specified end points (training documents) are placed at the extremes, and everything else clusters close to zero. This is being driven in part by two things that point to the sort of corpuses spatial models will work better with.

10.1.3 Wordscore With Lots of Data

First, these documents are relatively short for each author, such that we get a small amount of overlapping words between them. This gives the models little to work with in pulling a dimension of difference out, and we’ll tend to emphasize the small set of things that distinguish the particular pair (or in the case of Wordfish, the single outlier, as we’ll see below). What Wordscores is essentially doing is ranking documents by how similar they are to the training documents at either end of the spectrum. Try it out with a larger dataset.

10.1.4 Wordscores with a Narrow Discourse

The other issue is that although these Presidents are all speaking at the same formal event, they are potentially talking about very different things. Maybe there’s not enough consistency in topical focus across the inauguration speeches to map them into a single dimension. If we have a very narrow scope of potential discourse, then we may be more likely to find a reliable mapping to a single dimension even without lots of words per author. We’ve seen one example of such a narrowly-proscribed set of documents: the 2010 set of budget speeches by Irish parliamentary leaders used in Lowe and Benoit (2013).

summary(data_corpus_irishbudget2010)    
## Corpus consisting of 14 documents:
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1953   8641       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1040   4446       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1624   6393       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1595   7107       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1629   6599       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1148   4232       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   678   2297       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1181   4177       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   488   1286        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   439   1284        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   401   1030        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   510   1643        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   442   1240        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4044       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Wed Jun 28 22:04:18 2017
## Notes:
# let's fix up the doc names so they'll fit on graphs
new.ie.corp <- data_corpus_irishbudget2010
docnames(new.ie.corp) <- paste0(new.ie.corp$documents$party, "_", new.ie.corp$documents$name)
ie.dfm <- dfm(new.ie.corp)
docnames(ie.dfm)
##  [1] "FF_Lenihan"    "FG_Bruton"     "LAB_Burton"    "SF_Morgan"    
##  [5] "FF_Cowen"      "FG_Kenny"      "FG_ODonnell"   "LAB_Gilmore"  
##  [9] "LAB_Higgins"   "LAB_Quinn"     "Green_Gormley" "Green_Ryan"   
## [13] "Green_Cuffe"   "SF_OCaolain"

Let’s start by pinning down our dimension to the endpoints that Wordfish identified (in Lowe and Benoit (2013)): FF’s Cowen (document 5) and Labor’s Burton (document 3).

wordScoreMe(ie.dfm, c(3,5))

Okay, now we’re getting somewhere. The very narrow scope of a particular budget debate means that even with relatively short texts, we’re able to capture the split between government (FF and Green) and the opposition. As in their paper, though, Sinn Fein is quite close to the middle, despite having been human-coded as the most extreme.

How does it look if we pin down on the human-coded extremes?

# Sinn Fein's O'Caolain
wordScoreMe(ie.dfm, c(14,5))

# Sinn Fein's Morgan
wordScoreMe(ie.dfm, c(4,5))

Clearly choice of end points can make a large difference in the spatial positions of authors – great care needs to be taken to validate both the stability of the results and the interpretation that the dimension mapped represents some theoretical dimension of interest.

10.2 Wordfish

Wordfish is a fully unsupervised model for scaling a set of documents (or individuals) on a single dimension without having to specify what that dimension is. Since we know from Lowe and Benoit what Wordfish should produce with the Irish budget speeches, let’s start with them.

Rather than specifying the end points, we only need to specify the orientation – choosing one document to be “less than” another. textmodel_wordfish() (in quanteda) will default to the first and second document if we don’t specify a pair.

wf.fit <- textmodel_wordfish(ie.dfm)
summary(wf.fit)
## Call:
##  textmodel_wordfish.dfm(x = ie.dfm)
## 
## Estimated document positions:
##                    theta         SE       lower       upper
## FF_Lenihan    -1.8209409 0.02032313 -1.86077421 -1.78110753
## FG_Bruton      0.5932874 0.02818872  0.53803755  0.64853732
## LAB_Burton     1.1136883 0.01540288  1.08349862  1.14387791
## SF_Morgan      0.1219197 0.02846356  0.06613111  0.17770827
## FF_Cowen      -1.7724139 0.02364057 -1.81874944 -1.72607841
## FG_Kenny       0.7145832 0.02650303  0.66263729  0.76652918
## FG_ODonnell    0.4844867 0.04171532  0.40272465  0.56624872
## LAB_Gilmore    0.5616606 0.02967426  0.50349904  0.61982214
## LAB_Higgins    0.9703181 0.03850628  0.89484574  1.04579036
## LAB_Quinn      0.9589299 0.03892461  0.88263771  1.03522218
## Green_Gormley -1.1807223 0.07221382 -1.32226144 -1.03918325
## Green_Ryan    -0.1866513 0.06294146 -0.31001658 -0.06328606
## Green_Cuffe   -0.7422019 0.07245392 -0.88421153 -0.60019219
## SF_OCaolain    0.1840564 0.03666329  0.11219639  0.25591648

We’ll have to pull out the elements for our plot one at a time. To make it easier to reproduce, let’s put together a function that will go from model to plot. To do so, we’ll need to extract the features we want into a data.frame using the S4 notation

wordFishMe <- function(df) {
  wf.fit <- textmodel_wordfish(df)
  wf.df <- data.frame(name=wf.fit@docs,
                      theta=wf.fit@theta,
                      se=wf.fit@se.theta)
  
  ggplot(wf.df, aes(x=name, y=theta)) + 
        geom_errorbar(aes(ymin=theta - 1.96 * se, 
                          ymax=theta + 1.96 * se), 
                      width=.1) +
        geom_point() +
        theme(axis.text.x=element_text(angle=45, hjust=1))  
}

And then run it for the Irish budget speeches from Lowe and Benoit.

df <- data.frame(word = wf.fit@features, psi=wf.fit@beta)
head(arrange(df, psi), n=20)
##            word       psi
## 1       shortly -2.882330
## 2    innovation -2.852359
## 3    day-to-day -2.773505
## 4       summary -2.746085
## 5     stabilise -2.708261
## 6        sought -2.641552
## 7  arrangements -2.641552
## 8     intention -2.593868
## 9    innovative -2.550237
## 10       fallen -2.550237
## 11     retrofit -2.550237
## 12         base -2.492989
## 13     maintain -2.492989
## 14     electric -2.492989
## 15          vrt -2.492989
## 16        phase -2.453319
## 17      concern -2.415608
## 18   repayments -2.415608
## 19      125,000 -2.415608
## 20    treatment -2.415608
head(arrange(df, desc(psi)), n=20)
##           word      psi
## 1      sharing 4.741609
## 2     medicine 4.741609
## 3         gang 4.741609
## 4      bailout 4.366983
## 5         post 4.366983
## 6    graduates 3.912543
## 7    consensus 3.904842
## 8        one's 3.904842
## 9  apprentices 3.904842
## 10        soft 3.904842
## 11     lifting 3.904842
## 12       awful 3.904842
## 13     trolley 3.904842
## 14        game 3.904842
## 15      sunday 3.904842
## 16    borrowed 3.904842
## 17        echo 3.904842
## 18         men 3.641201
## 19   societies 3.547394
## 20  remarkable 3.292733
wordFishMe(ie.dfm)

Alright, so we look like Lowe and Benoit’s results – including placing the most anti-budget human-coded Sinn Fein in the very center of the score dimension.

Let’s see if Wordfish does any better with our inauguration speeches than Wordscore did.

wordFishMe(inaug.dfm)

We’re able to pull out a lot more variation than we were with Wordscores. However, the meaning of the dimension isn’t entirely clear. There’s a definitive pre-war, intra-war, and post-war split, so one hypothesis might be that we are capturing a temporal dimension. The three discontinuities map to the shift to radio broadcast with Roosevelt and then television broadcast with Kennedy so that might be it as well.

Maybe if we narrow our scope to post-Eisenhower, we’ll be able to recover a political dimension?

inaug.post.war <- dfm(corpus_subset(data_corpus_inaugural, Year>1960))
wordFishMe(inaug.post.war)

Two things are worthy of note here. First, the pattern still appears to be primarily temporal – this time mapping to a Cold-War and post-Cold-War inaugurations.

Interestingly, the relative pattern between the included presidents is similar to what we saw between them when fitting the full set. This may be coincidence, or it may indicate that we’re not pinning down what’s driving positioning within this dimension.

10.3 Multiple Dimensions

Rather than imposing a uni-dimensional spatial model on the data, correspondence analysis can allow us to pull out a multi-dimensional spatial model from the texts. The ca package implements this approach. This lab will implement a simple two-dimension CA. To find out more about running it with larger number of spatial dimensions, check out the ca documentation. Conveniently, quanteda has a wrapper function for ca as well.

# set the number of dimensions (nd) to 2
wca <- textmodel_ca(inaug.dfm, nd = 2)

We can compare our first dimension to the dimension produced using word fish to see that the two are highly correlated.

wf.fit <- textmodel_wordfish(inaug.dfm)
plot.df <- data.frame(wordfish = wf.fit@theta,
                      ca.dim1 = wca$rowcoord[,1],
                      ca.dim2 = wca$rowcoord[,2],
                      name = wf.fit@docs)
ggplot(plot.df, aes(x=wordfish, y=ca.dim1, label=name)) +
  geom_text(hjust=0) +
  geom_point()

cor(wf.fit@theta, wca$rowcoord[,1])
## [1] 0.9363208

So, our first dimension in CA is mostly picking up the same dimension as Wordfish. What is our second dimension getting?

ggplot(plot.df, aes(x=ca.dim1, y=ca.dim2, label=name)) +
  geom_text(hjust=0) +
  geom_point()

Taftness and Hardingness. Let’s zoom in on the people who aren’t defining a dimension and see how Taftness and Hardingness map into what we know about them

ggplot(plot.df, aes(x=ca.dim1, y=ca.dim2, label=name)) +
  geom_text(hjust=0) +
  geom_point() +
  xlim(-1,1.1) + ylim(-0.8,0.8)
## Warning: Removed 4 rows containing missing values (geom_text).
## Warning: Removed 4 rows containing missing values (geom_point).

11 Attribution

This draws on material prepared by Clark Bernier.